Goto

Collaborating Authors

 pruning rate


Efficient Multi-bit Quantization Network Training via Weight Bias Correction and Bit-wise Coreset Sampling

Neural Information Processing Systems

Multi-bit quantization networks enable flexible deployment of deep neural networks by supporting multiple precision levels within a single model. However, existing approaches suffer from significant training overhead as full-dataset updates are repeated for each supported bit-width, resulting in a cost that scales linearly with the number of precisions. Additionally, extra fine-tuning stages are often required to support additional or intermediate precision options, further compounding the overall training burden. To address this issue, we propose two techniques that greatly reduce the training overhead without compromising model utility: (i) Weight bias correction enables shared batch normalization and eliminates the need for fine-tuning by neutralizing quantization-induced bias across bit-widths and aligning activation distributions; and (ii) Bit-wise coreset sampling strategy allows each child model to train on a compact, informative subset selected via gradient-based importance scores by exploiting the implicit knowledge transfer phenomenon. Experiments on CIFAR-10/100, TinyImageNet, and ImageNet-1K with both ResNet and ViT architectures demonstrate that our method achieves competitive or superior accuracy while reducing training time up to 7.88 . Our code is released at this link.



Experimental Results of Pruning Plasticity

Neural Information Processing Systems

We also studied pruning plasticity on structured pruning. In particular, we choose the filter pruning method used in Li et al. [32]. The pruning criterion is the absolute weight sum of each nonzero filter and the regeneration criterion is the absolute gradient sum of each zero filter. We first pre-train four sets of neural networks from scratch with various structured sparsity, including 0, 0.10, 0.50, and 0.70, noted as "Pre-trained Sparsity" in the figure title. To measure the plasticity of these pre-trained models, we choose four different pruning rates noted as "Pruning rate" to remove filters from these pre-trained models.





UniQL: Unified Quantization and Low-rank Compression for Adaptive Edge LLMs

arXiv.org Artificial Intelligence

Deploying large language models (LLMs) on mobile platforms faces significant challenges due to the limited memory and shared computational resources of the device. Resource availability may be an issue as it is directly impacted by the current device workload, adding to the uncertainty of model deployment. We introduce UniQL, a unified post-training quantization and low-rank compression framework with on-device configurable pruning rates for edge LLMs. UniQL is a general framework that integrates quantization and low-rank compression for Transformers, State Space Models (SSMs), and hybrid models to support diverse edge applications. In our proposed joint framework, we introduce an efficient structured weight-sorting method that speeds up computation by 20x, quantization-aware singular value decomposition (SVD) to minimize quantization errors, state-aware weight sorting for SSMs, and a fused rotary positional embedding (RoPE) kernel for pruned models. Our framework performs weight-sorting, fine-tuning, and quantization in the cloud in a single-pass workflow, while enabling on-device configurable pruning rates up to 35%. Our experiments show that quantized and pruned models achieve a memory reduction of 4x-5.7x and a token-throughput improvement of 2.7x-3.4x, maintaining accuracy within 5% of the original models at 15% pruning across Transformers (Llama3 and Qwen2.5), SSMs (Mamba2), and hybrid models (Nemotron-H and Bamba-v2). The code and quantized models are available at: https://github.com/enyac-group/UniQL.


FAIR-Pruner: Leveraging Tolerance of Difference for Flexible Automatic Layer-Wise Neural Network Pruning

arXiv.org Artificial Intelligence

Neural network pruning has been widely adopted to reduce the parameter scale of complex neural networks, enabling efficient deployment on resource-limited edge devices. Mainstream pruning methods typically adopt uniform pruning strategies, which tend to cause a substantial performance degradation under high sparsity levels. Recent studies focus on non-uniform layer-wise pruning, but such approaches typically depend on global architecture optimization, which is computational expensive and lacks flexibility. To address these limitations, this paper proposes a novel method named Flexible Automatic Identification and Removal (FAIR)-Pruner, which adaptively determines the sparsity levels of each layer and identifies the units to be pruned. The core of FAIR-Pruner lies in the introduction of a novel indicator, Tolerance of Differences (ToD), designed to balance the importance scores obtained from two complementary perspectives: the architecture-level (Utilization Score) and the task-level (Reconstruction Score). By controlling ToD at preset levels, FAIR-Pruner determines layer-specific thresholds and removes units whose Utilization Scores fall below the corresponding thresholds. Furthermore, by decoupling threshold determination from importance estimation, FAIR-Pruner allows users to flexibly obtain pruned models under varying pruning ratios. Extensive experiments demonstrate that FAIR-Pruner achieves state-of-the-art performance, maintaining higher accuracy even at high compression ratios. Moreover, the ToD based layer-wise pruning ratios can be directly applied to existing powerful importance measurements, thereby improving the performance under uniform-pruning.


TETRIS: TilE-matching the TRemendous Irregular Sparsity

Neural Information Processing Systems

Compressing neural networks by pruning weights with small magnitudes can significantly reduce the computation and storage cost. Although pruning makes the model smaller, it is difficult to get a practical speedup in modern computing platforms such as CPU and GPU due to the irregularity.


Discrimination-aware Channel Pruning for Deep Neural Networks

Neural Information Processing Systems

Channel pruning is one of the predominant approaches for deep model compression. Existing pruning methods either train from scratch with sparsity constraints on channels, or minimize the reconstruction error between the pre-trained feature maps and the compressed ones. Both strategies suffer from some limitations: the former kind is computationally expensive and difficult to converge, whilst the latter kind optimizes the reconstruction error but ignores the discriminative power of channels.